Goto

Collaborating Authors

 chinese word


NushuRescue: Revitalization of the Endangered Nushu Language with AI

arXiv.org Artificial Intelligence

The preservation and revitalization of endangered and extinct languages is a meaningful endeavor, conserving cultural heritage while enriching fields like linguistics and anthropology. However, these languages are typically low-resource, making their reconstruction labor-intensive and costly. This challenge is exemplified by Nushu, a rare script historically used by Yao women in China for self-expression within a patriarchal society. To address this challenge, we introduce NushuRescue, an AI-driven framework designed to train large language models (LLMs) on endangered languages with minimal data. NushuRescue automates evaluation and expands target corpora to accelerate linguistic revitalization. As a foundational component, we developed NCGold, a 500-sentence Nushu-Chinese parallel corpus, the first publicly available dataset of its kind. Leveraging GPT-4-Turbo, with no prior exposure to Nushu and only 35 short examples from NCGold, NushuRescue achieved 48.69% translation accuracy on 50 withheld sentences and generated NCSilver, a set of 98 newly translated modern Chinese sentences of varying lengths. A sample of both NCGold and NCSilver is included in the Supplementary Materials. Additionally, we developed FastText-based and Seq2Seq models to further support research on Nushu. NushuRescue provides a versatile and scalable tool for the revitalization of endangered languages, minimizing the need for extensive human input.


Deep Learning, NLP, and Representations - colah's blog

#artificialintelligence

In the last few years, deep neural networks have dominated pattern recognition. They blew the previous state of the art out of the water for many computer vision tasks. Voice recognition is also moving that way. But despite the results, we have to wonder… why do they work so well? In doing so, I hope to make accessible one promising answer as to why deep neural networks work. I think it's a very elegant perspective. A neural network with a hidden layer has universality: given enough hidden units, it can approximate any function. This is a frequently quoted – and even more frequently, misunderstood and applied – theorem.


cw2vec: Learning Chinese Word Embeddings with Stroke n-gram Information

AAAI Conferences

We propose cw2vec, a novel method for learning Chinese word embeddings. It is based on our observation that exploiting stroke-level information is crucial for improving the learning of Chinese word embeddings. Specifically, we design a minimalist approach to exploit such features, by using stroke n-grams, which capture semantic and morphological level information of Chinese words. Through qualitative analysis, we demonstrate that our model is able to extract semantic information that  cannot be captured by existing methods. Empirical results on the word similarity, word analogy, text classification and named entity recognition tasks show that the proposed approach consistently outperforms state-of-the-art approaches such as word-based word2vec and GloVe, character-based CWE, component-based JWE and pixel-based GWE.


Deep Learning, NLP, and Representations - colah's blog

#artificialintelligence

In the last few years, deep neural networks have dominated pattern recognition. They blew the previous state of the art out of the water for many computer vision tasks. Voice recognition is also moving that way. But despite the results, we have to wonder… why do they work so well? In doing so, I hope to make accessible one promising answer as to why deep neural networks work. I think it's a very elegant perspective. A neural network with a hidden layer has universality: given enough hidden units, it can approximate any function. This is a frequently quoted – and even more frequently, misunderstood and applied – theorem.


Deep Learning, NLP, and Representations - colah's blog

#artificialintelligence

In the last few years, deep neural networks have dominated pattern recognition. They blew the previous state of the art out of the water for many computer vision tasks. Voice recognition is also moving that way. But despite the results, we have to wonder… why do they work so well? In doing so, I hope to make accessible one promising answer as to why deep neural networks work. I think it's a very elegant perspective. A neural network with a hidden layer has universality: given enough hidden units, it can approximate any function. This is a frequently quoted – and even more frequently, misunderstood and applied – theorem.


CHIME: An Efficient Error-Tolerant Chinese Pinyin Input Method

AAAI Conferences

Chinese Pinyin input methods are very important for Chinese language processing. In many cases, users may make typing errors. For example, a user wants to type in "shenme" (什么, meaning "what" in English) but may type in "shenem" instead. Existing Pinyin input methods fail in converting such a Pinyin sequence with errors to the right Chinese words. To solve this problem, we developed an efficient error-tolerant Pinyin input method called "CHIME'' that can handle typing errors. By incorporating state-of-the-art techniques and language-specific features, the method achieves a better performance than state-of-the-art input methods. It can efficiently find relevant words in milliseconds for an input Pinyin sequence.